A Characterization of Compound Documents on the Web
نویسندگان
چکیده
Recent developments in office productivity suites make it easier for users to publish rich compound documents on the Web. Compound documents appear as a single unit of information but may contain data generated by different applications, such as text, images, and spreadsheets. Given the popularity enjoyed by these office suites and the pervasiveness of the Web as a publication medium, we expect that in the near future these compound documents will become an increasing proportion of the Web’s content. As a result, the content handled by servers, proxies, and browsers may change considerably from what is currently observed. Furthermore, these compound documents are currently treated as opaque byte streams, but future Web infrastructure may wish to understand their internal structure to provide higher-quality service. In order to guide the design of this future Web infrastructure, we characterize compound documents currently found on the Web. Previous studies of Web content either ignored these document types altogether or did not consider their internal structure. We study compound documents originated by the three most popular applications from the Microsoft Office suite: Word, Excel, and PowerPoint. Our study encompasses over 12,500 documents retrieved from 935 different Web sites. Our main conclusions are: 1. Compound documents are in general much larger than current HTML documents. 2. For large documents, embedded objects and images make up a large part of the documents’ size. 3. For small documents, XML format produces much larger documents than OLE. For large documents, there is little difference. 4. Compression considerably reduces the size of documents in both formats.
منابع مشابه
RRLUFF: Ranking function based on Reinforcement Learning using User Feedback and Web Document Features
Principal aim of a search engine is to provide the sorted results according to user’s requirements. To achieve this aim, it employs ranking methods to rank the web documents based on their significance and relevance to user query. The novelty of this paper is to provide user feedback-based ranking algorithm using reinforcement learning. The proposed algorithm is called RRLUFF, in which the rank...
متن کاملWeb pages ranking algorithm based on reinforcement learning and user feedback
The main challenge of a search engine is ranking web documents to provide the best response to a user`s query. Despite the huge number of the extracted results for user`s query, only a small number of the first results are examined by users; therefore, the insertion of the related results in the first ranks is of great importance. In this paper, a ranking algorithm based on the reinforcement le...
متن کاملبررسی تولیدات علمی در زمینه حقوق بیماران در عرصه بینالمللی نمایه شده در پایگاه Web of Science بین سالهای 2000 تا 2014
Introduction: One of the criteria showing the importance of a research area is the scientific products in that research area. The aim of the current study was to investigate the situation of scientific products on the topic of Patients’ rights indexed in ISI-Web of Science between the years 2000 until 2014. Methods: The method used was descriptive-cross sectional with a Scientometrics...
متن کاملSurvey of Iranian gastroenterology and hepatology scientific productions in Web of Science database from 1983 to 2017
Background: One of the most important criteria of the development of countries at the national and international levels is the survey of scientific productions indexed in authentic databases. This study aimed to analyze the scientific productions by Iranian researchers on gastroenterology and hepatology in the Web of Science (WOS) database. Methods: This applied study used a scientometric appr...
متن کاملAn Ensemble Click Model for Web Document Ranking
Annually, web search engine providers spend more and more money on documents ranking in search engines result pages (SERP). Click models provide advantageous information for ranking documents in SERPs through modeling interactions among users and search engines. Here, three modules are employed to create a hybrid click model; the first module is a PGM-based click model, the second module in a d...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1999